68 research outputs found

    Position-Aware Contrastive Alignment for Referring Image Segmentation

    Full text link
    Referring image segmentation aims to segment the target object described by a given natural language expression. Typically, referring expressions contain complex relationships between the target and its surrounding objects. The main challenge of this task is to understand the visual and linguistic content simultaneously and to find the referred object accurately among all instances in the image. Currently, the most effective way to solve the above problem is to obtain aligned multi-modal features by computing the correlation between visual and linguistic feature modalities under the supervision of the ground-truth mask. However, existing paradigms have difficulty in thoroughly understanding visual and linguistic content due to the inability to perceive information directly about surrounding objects that refer to the target. This prevents them from learning aligned multi-modal features, which leads to inaccurate segmentation. To address this issue, we present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features by guiding the interaction between vision and language through prior position information. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment by comparing the features of the referred object with those of related objects. Extensive experiments on three benchmarks demonstrate our PCAN performs favorably against the state-of-the-art methods. Our code will be made publicly available.Comment: 12 pages, 6 figure

    CCLAP: Controllable Chinese Landscape Painting Generation via Latent Diffusion Model

    Full text link
    With the development of deep generative models, recent years have seen great success of Chinese landscape painting generation. However, few works focus on controllable Chinese landscape painting generation due to the lack of data and limited modeling capabilities. In this work, we propose a controllable Chinese landscape painting generation method named CCLAP, which can generate painting with specific content and style based on Latent Diffusion Model. Specifically, it consists of two cascaded modules, i.e., content generator and style aggregator. The content generator module guarantees the content of generated paintings specific to the input text. While the style aggregator module is to generate paintings of a style corresponding to a reference image. Moreover, a new dataset of Chinese landscape paintings named CLAP is collected for comprehensive evaluation. Both the qualitative and quantitative results demonstrate that our method achieves state-of-the-art performance, especially in artfully-composed and artistic conception. Codes are available at https://github.com/Robin-WZQ/CCLAP.Comment: 8 pages,13 figure

    Semantic Graph Representation Learning for Handwritten Mathematical Expression Recognition

    Full text link
    Handwritten mathematical expression recognition (HMER) has attracted extensive attention recently. However, current methods cannot explicitly study the interactions between different symbols, which may fail when faced similar symbols. To alleviate this issue, we propose a simple but efficient method to enhance semantic interaction learning (SIL). Specifically, we firstly construct a semantic graph based on the statistical symbol co-occurrence probabilities. Then we design a semantic aware module (SAM), which projects the visual and classification feature into semantic space. The cosine distance between different projected vectors indicates the correlation between symbols. And jointly optimizing HMER and SIL can explicitly enhances the model's understanding of symbol relationships. In addition, SAM can be easily plugged into existing attention-based models for HMER and consistently bring improvement. Extensive experiments on public benchmark datasets demonstrate that our proposed module can effectively enhance the recognition performance. Our method achieves better recognition performance than prior arts on both CROHME and HME100K datasets.Comment: 12 Page

    ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

    Full text link
    In addition to the unprecedented ability in imaginary creation, large text-to-image models are expected to take customized concepts in image generation. Existing works generally learn such concepts in an optimization-based manner, yet bringing excessive computation or memory burden. In this paper, we instead propose a learning-based encoder, which consists of a global and a local mapping networks for fast and accurate customized text-to-image generation. In specific, the global mapping network projects the hierarchical features of a given image into multiple new words in the textual word embedding space, i.e., one primary word for well-editable concept and other auxiliary words to exclude irrelevant disturbances (e.g., background). In the meantime, a local mapping network injects the encoded patch features into cross attention layers to provide omitted details, without sacrificing the editability of primary concepts. We compare our method with existing optimization-based approaches on a variety of user-defined concepts, and demonstrate that our method enables high-fidelity inversion and more robust editability with a significantly faster encoding process. Our code is publicly available at https://github.com/csyxwei/ELITE.Comment: Accepted by ICCV 2023, oral presentation. Code: https://github.com/csyxwei/ELIT

    Patch Is Not All You Need

    Full text link
    Vision Transformers have achieved great success in computer visions, delivering exceptional performance across various tasks. However, their inherent reliance on sequential input enforces the manual partitioning of images into patch sequences, which disrupts the image's inherent structural and semantic continuity. To handle this, we propose a novel Pattern Transformer (Patternformer) to adaptively convert images to pattern sequences for Transformer input. Specifically, we employ the Convolutional Neural Network to extract various patterns from the input image, with each channel representing a unique pattern that is fed into the succeeding Transformer as a visual token. By enabling the network to optimize these patterns, each pattern concentrates on its local region of interest, thereby preserving its intrinsic structural and semantic information. Only employing the vanilla ResNet and Transformer, we have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and have achieved competitive results on ImageNet

    Optimization of network structure to random failures

    Full text link
    Network's resilience to the malfunction of its components has been of great concern. The goal of this work is to determine the network design guidelines, which maximizes the network efficiency while keeping the cost of the network (that is the average connectivity) constant. With a global optimization method, memory tabu search (MTS), we get the optimal network structure with the approximately best efficiency. We analyze the statistical characters of the network and find that a network with a small quantity of hub nodes, high degree of clustering may be much more resilient to perturbations than a random network and the optimal network is one kind of highly heterogeneous networks. The results strongly suggest that networks with higher efficiency are more robust to random failures. In addition, we propose a simple model to describe the statistical properties of the optimal network and investigate the synchronizability of this model.Comment: 11 pages, 6 figures, accepted by Physica

    Local Global Relational Network for Facial Action Units Recognition

    Get PDF
    Many existing facial action units (AUs) recognition approaches often enhance the AU representation by combining local features from multiple independent branches, each corresponding to a different AU. However, such multi-branch combination-based methods usually neglect potential mutual assistance and exclusion relationship between AU branches or simply employ a pre-defined and fixed knowledge-graph as a prior. In addition, extracting features from pre-defined AU regions of regular shapes limits the representation ability. In this paper, we propose a novel Local Global Relational Network (LGRNet) for facial AU recognition. LGRNet mainly consists of two novel structures, i.e., a skip-BiLSTM module which models the latent mutual assistance and exclusion relationship among local AU features from multiple branches to enhance the feature robustness, and a feature fusion&refining module which explores the complementarity between local AUs and the whole face in order to refine the local AU features to improve the discriminability. Experiments on the BP4D and DISFA AU datasets show that the proposed approach outperforms the state-of-the-art methods by a large margin

    Experimental Investigation on the DLC Film Coating Technology in Scroll Compressors of Automobile Air Conditioning

    No full text
    The friction of the orbiting scroll leads to large power consumption and low energy efficiency of the scroll compressor. The common methods to solve this problem are high cost and a complex process. Considering special structures and operating principles to apply the coating technology on the scroll compressor is a new subject. Given the material of the orbiting scroll being aluminum alloy, the unbalanced magnetron sputtering technology for the orbiting scroll of the scroll compressor was chosen and the Cr transition layer was coated to enhance the bonding strength. Moreover, we innovatively performed an experiment to verify the feasibility of unbalanced magnetron sputtering film coating technology for the diamond-like carbon film coated in the scroll compressor. This article elaborates the parameter test methods of the film properties before and after experiments and the experimental system components. The results showed that the diamond-like carbon film has low coefficient and high bonding strength, which renders it a good wear-reducing effect and an excellent self-lubricating property. Due to the thin film layer and high operating temperature, the thickness should be increased to raise the abrasion resistance. The refrigeration system with the scroll compressor coated with the diamond-like carbon film can satisfy the national standard conditions with low Vickers hardness. Its performance was improved at low speed. Therefore, the unbalanced magnetron sputtering with increased Cr bond layer is a feasible and appropriate technology for coating diamond-like carbon film
    corecore